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Technical Field 

This invention proposes a low complexity and effective silence detection technique based on an 
intelligent determination of adaptive threshold value to enable real-time audio/video 
conferencing. 
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Background of the Invention 

Thanks to the recent advances in audio/video compression, processor design, and communication 
network architecture, it is now quite feasible to implement multimedia communication 
applications (e.g., audio/video conferencing) using standard computing and networking facilities. 
This shift of multimedia communication equipment and services from dedicated systems to 
general purpose computers and packet-based communication networks has introduced a quite 
different operating environment and has prompted the reexamination of several key algorithms. 
Silence detection and removal is an essential building block of any multimedia video 
conferencing system. It reduces the bandwidth requirements of the underlying network transport 
service and helps to maintain an acceptable end-to-end delay for audio. 

HomeMeeting Inc. provides complete Internet service ( www.homemeeting.com ) for multipoint 
multimedia IP-communication network. To the best of our knowledge, this is the first attempt of 
fully Internet-based interactive multipoint multimedia WAN communication service with 
enhanced quality of service (QoS) and a complete suite of presentation/discussion functionalities 
over narrowband (as low as 26.4 Kbps) connections. Every registered member of this service can 
sign into the Member Meeting Center from HomeMeeting's website, schedule meeting, invite 
meeting participants, and pre-upload documents for online discussion. To avoid the need of 
multiple microphone requirement which is feasible for most low-end audio/video conferencing 
terminals, and to avoid the need of using very complex signal processing algorithms which call 
for higher computational needs and longer voice delay, in this invention, a low complexity and 
effective silence detection technique based on an intelligent determination of adaptive threshold 
value is proposed to enable real-time audio/video conferencing. 
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Prior Art 

The issue of silence detection has been explored since digital speech processing research was 
initiated more than 40 years ago [1]. The use of energy levels and/or zero crossing rates for 
silence detection can be satisfactory only at high signal-to-noise ratios. A wide variety of 
approaches have been proposed, from the simplest form based on comparing the signal 
magnitude with a pre-specified threshold which results in poor performance in the presence of 
background noise and varying magnitudes, to very sophisticated algorithm, such as the use of 
third-order statistics to exploit the non-linearity of speech characteristics at the changeovers of 
speech and silence [2] which is too complex, particularly for real-time software based 
implementation on general purpose computers. 



Based on the short-term energy and zero-crossing measures of speech signals, a low complexity, 
while less effective and less flexible, silence detection algorithm was proposed in [3]. More 
specifically, the pre-specified £ can be determined as follows: 



I, =0.03(£ Y —E.)+E 

I x max min 7 mm 



I = 4E. 

c min 



E thresh =5xmin(/ 1 ,/ 2 ) 



where E max and Zs min are the maximum and minimum energy values (sum of squared 
magnitudes over certain interval of time, e.g., 10 msec) estimated over entire speech interval. 
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A somewhat more complex algorithm, adopted in ITU G.729 Annex B [4], uses the degree of 
periodicity in signals to determine the presence of voice. However, it is not very effective in a 
conference call environment where several people may speak at the same time, and its 
computational requirement makes it harder to implement for a real-time application using low- 
end hardware devices (such as handheld PDAs). Another attempt is made by IC Tech. Inc. [5], 
which specifically combats the silence detection problem in noisy environment, especially when 
the distance between the microphone and the user's lips is varying, using a proprietary voice 
extraction (VE) technique which is achieved by exploiting inter-microphone differential 
information and the statistical properties of independent signal sources. This technique requires 
the use of multiple (at least two) microphones for recording mixtures of sound sources, which are 
then processed to separate out a single voice signal of interest from the mixture. For low-end 
audio/video conferencing terminals, the requirement of multiple microphones is never a feasible 
alternative. 

Objects and Advantages 

This invention proposed a low complexity and effective silence detection technique based on an 
intelligent determination of adaptive threshold value to enable real-time audio/video 
conferencing. More specifically, by appropriately low passing the speech signal to remove the 
less influential high-frequency component as well the DC component of speech for an effective 
calculation of speech magnitude, we can best measure the most important portion of uttered 
speech. Moreover, through our invented adaptive threshold determination scheme, the silence 
detection system can adaptively update the silence threshold value by incorporating the new 
background signal magnitude so as to dynamically detect the silence from the real speech. 
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Summary of the Invention 



Thanks to the recent advances in audio/video compression, processor design, and communication 
network architecture, it is now quite feasible to implement multimedia communication 
applications (e.g., audio/video conferencing) using standard computing and networking facilities. 
This shift of multimedia communication equipment and services from dedicated systems to 
general purpose computers and packet-based communication networks has introduced a quite 
different operating environment and has prompted the reexamination of several key algorithms. 
Silence detection and removal is an essential building block of any multimedia video 
conferencing system. It reduces the bandwidth requirements of the underlying network transport 
service and helps to maintain an acceptable end-to-end delay for audio. 

To avoid the need of multiple microphone requirement which is feasible for most low-end 
audio/video conferencing terminals, and to avoid the need of using very complex signal 
processing algorithms which call for higher computational needs and longer voice delay, in this 
invention, a low complexity and effective silence detection technique based on an intelligent 
determination of adaptive threshold value is proposed to enable real-time audio/video 
conferencing. 

Detailed Description of the Invention 
I. Measuring the Sound Wave Magnitude 

To determine the magnitude of sound waves, the incoming speech data are first separated into 
non-overlapping frames for effective processing. Each frame consists of 1200 samples (i.e., 150 
msec of speech under 8000 samples/sec input rate). The input sound data s(t) is first low-pass 
filtered to remove the high frequency components. 
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/(0)= 5 (0)x2, 

/(*) = s(r-l)+.s(f), l<f<1200 

The DC component is then removed from /(r), and the absolute value is computed for each 
sample. 



gU) = \f(t) - f\, 0 < t < 1200 , 

where 



1199 



1200 

The magnitude of speech signal 0 in this frame is defined by the equation. 

1199 

1199 ^gti) 

o = ^\g(0 ~ m[ where m 



— _ /=o 



7=0 



1200 

If a is smaller than a threshold value k, this frame is determined to be a silent frame. 



II. Determining the Adaptive Threshold Value 

During the conferencing, the background environment changes along the time, the intensity of 
participants' speech also varies all the time due to the movement of heads (in case a fixed 
location microphone is used). The threshold value X needs to be changed according to the 
environments. To change X, a value d is computed for 8 consecutive frames. 
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where 




If d is greater than a pre-specified empirical constant £, then X is not updated. If d is smaller, the 
source of the sound is determined from the background and X is updated as a function of d and 
^ max accordingly: 



A<-A + <M</,a max ), 

where the function <j> can be any general function. In our current implementation, a relatively 
simple function was chosen, i.e., 



A <— A + A if mxa max > A 
A <— A - A if mX(T m „ <A-100 
A <— A else 



max 



• if d < k 



^max = max cr. 

/=0 



where A is an empirical positive constant, m is another empirical constant with value greater 
than 1. 



Confidential 



Page 7 



11/6/01 



